The variance of identity-by-descent sharing in the Wright-Fisher model 
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Widespread sharing of long, identical-by-descent (IBD) genetic segments is a hallmark of popu- 
lations that have experienced recent genetic drift. Detection of these IBD segments has recently 
become feasible, enabling a wide range of applications from phasing and imputation to demographic 
inference. Here, we study the distribution of IBD sharing in the Wright-Fisher model. Specifically, 
using coalescent theory, we calculate the variance of the total sharing between random pairs of in- 
dividuals. We then investigate the cohort-averaged sharing: the average total sharing between one 
individual and the rest of the cohort. We find that for large cohorts, the cohort-averaged sharing 
is distributed approximately normally. Surprisingly, the variance of this distribution does not van- 
ish even for large cohorts, implying the existence of "hyper-sharing" individuals. The presence of 
such individuals has consequences for the design of sequencing studies, since, if they are selected for 
whole-genome sequencing, a larger fraction of the cohort can be subsequently imputed. We calculate 
the expected gain in power of imputation by IBD, and subsequently, in power to detect an associ- 
ation, when individuals are either randomly selected or specifically chosen to be the hyper-sharing 
individuals. Using our framework, we also compute the variance of an estimator of the population 
size that is based on the mean IBD sharing and the variance in the sharing between inbred siblings. 
Finally, we study IBD sharing in an admixture pulse model, and show that in the Ashkenazi Jewish 
population the admixture fraction is correlated with the cohort-averaged sharing. 



I. INTRODUCTION 

In isolated populations, even purported unrelated in- 
dividuals often share genetic material that is identical- 
by-descent, or IBD. Traditionally, the term IBD sharing 
referred to co-ancestry at a single site (or autozygosity, 
in the case of a diploid individual) and was widely in- 
vestigated as a measure of the degree of inbreeding in a 
population [lj. Recent years have brought dramatic in- 
creases in the quantity and density of available genetic 
data and, together with new computational tools, en- 
abled the detection of IBD sharing of entire genomic seg- 
ments (see, e.g., [2-8]). The availability of IBD detection 
tools that are efficient enough to detect shared segments 
in large cohorts has resulted in numerous applications, 
from demographic inference [9, 10] and characterization 
of populations [11] to selection detection [12], relatedness 
detection and pedigree reconstruction [13-16], prioritiza- 
tion of individuals for sequencing [17], inference of HLA 
type [18], detection of haplotypes associated with a dis- 
ease or a trait [19-21], imputation [22], and phasing [23]. 

Recently, some of us used coalescent theory to calcu- 
late several theoretical quantities of IBD sharing under a 
number of demographic histories. Then, shared segments 
were detected in real populations, and their demographic 
histories were inferred [10]. Here, we expand upon [10] 
to investigate additional aspects of the stochastic varia- 
tion in IBD sharing. Specifically, we provide a precise 



calculation for the variance of the total sharing in the 
Wright-Fisher model, either between a random pair of 
individuals or between one individual and all others in 
the cohort. 

Understanding the variation in IBD sharing is an im- 
portant theoretical characterization of the Wright-Fisher 
model, and additionally, it has several practical applica- 
tions. For example, it can be used to calculate the vari- 
ance of an estimator of the population size that is based 
on the sharing between random pairs. In a different do- 
main, the variance in IBD sharing is needed to accurately 
assess strategies for sequencing studies, specifically, in 
prioritization of individuals to be sequenced. This is 
because imputation strategies use IBD sharing between 
sequenced individuals and genotyped, not-sequenced in- 
dividuals to increase the number of effective sequences 
analyzed in the association study [17, 22, 23]. 

In the remainder of the paper, we first review the 
derivation of the mean fraction of the genome shared be- 
tween two individuals [10]. We then calculate the vari- 
ance of this quantity using coalescent theory with re- 
combination. We provide a number of approximations, 
one of which results in a surprisingly simple expression, 
which is then generalized to a variable population size 
and to the sharing of segments in a length range. We 
also numerically investigate the pairwise sharing distri- 
bution and provide an approximate fit. We then turn to 
the average total sharing between each individual and the 
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entire cohort. We show that this quantity, which we term 
the cohort-averaged sharing, is approximately normally 
distributed, but is much wider than naively expected, 
implying the existence of hyper-sharing individuals. We 
consider several applications: the number of individuals 
needed to be sequenced to achieve a certain imputation 
power and the implications to disease mapping, inference 
of the population size based on the total sharing, and 
the variance of the sharing between siblings. We finally 
calculate the mean and the variance of the sharing in an 
admixture pulse model and show numerically that admix- 
ture results in a broader than expected cohort-averaged 
sharing. Therefore, large variance of the cohort-averaged 
sharing can indicate admixture. In the Ashkenazi Jew- 
ish population, we show that the cohort-averaged sharing 
is strongly anti-correlated with the fraction of European 
ancestry. 



II. RESULTS 

A. Variation in IBD sharing in the Wright-Fisher 
model 

1. Definitions 

The Wright-Fisher model. — We consider the standard 
Wright-Fisher model for a finite, isolated population, de- 
scribed by 2N haploid chromosomes, where each pair 
of chromosomes corresponds to one diploid individual. 
Each chromosome in the current generation descends, 
with equal probability, from one of the chromosomes 
in the previous generation, and recombination occurs at 
rate 0.01/cM per generation. The Wright-Fisher model 
has been widely investigated both in forward dynamics 
and under the coalescent [24] . For simplicity of notation, 
we denote the number of individuals, or the population 
size, as N, even though we really refer to the number of 
haploids and not the number of individuals. Throughout 
most of the analysis, we assume that each individual car- 
ries a single chromosome of length L centiMorgans (cM). 

IBD sharing. — We say that a genomic segment is 
shared, or is IBD, between two individuals if it is longer 
than m(cM) and it has been inherited without recombi- 
nation from a single common ancestor. We do not require 
the shared segments to be completely identical. That is, 
if any mutation has occurred since the time of the most 
recent common ancestor (MRCA), that would not dis- 
qualify the segments from being shared IBD according 
to our definition. The reason is that even in the presence 
of mutations, an order of magnitude calculation shows 
that regardless of the segment length, two individuals 
sharing a segment are expected to differ in just « 1 site 
along the segment (see the Supplementary Material, Sec- 
tion Sl.l). Therefore, in a long IBD segment, the number 
of differences should be very small compared to the num- 
ber of matches. In practice, there are also other sources 
of error in IBD detection, most notably phase switch 



errors. We assume, however, that there always exists 
a large enough length threshold above which segments 
are detectable without errors [3, 7], which corresponds to 
the parameter m introduced above; the precise value of 
the threshold will depend on the genotyping/sequencing 
technology. We assume that information is available for 
M markers, uniformly distributed (in genetic distance) 
along the chromosome, and densely enough that any ef- 
fect caused by the discreteness of the markers is negligi- 
ble (say, if m ■ (M / L) 3> 1). We define the total sharing 
between two individuals as the fraction of their markers 
that are found in shared segments. 



2. Mean total sharing 

In this subsection, we review the derivation of the mean 
fraction of the genome found in segments shared between 
two individuals [10]. We assume that the coalescent pro- 
cess along the chromosome can be approximated by the 
Sequentially Markov Coalescent [25], and ignore the dif- 
ferent behavior of sites at the ends of the chromosome. 
Consider first a single site s and assume that its MRCA 
dates g generations ago. The total length i of the seg- 
ment in which the site is found is the sum of £r and 
£l, where tf> and 1^ are the segment lengths to the right 
and left of s, respectively (all lengths are in cM). The dis- 
tributions of £r and £l are exponential with rate g/50, 
since the two individuals were separated by 2g meioses, 
each of which introduces a recombination event with rate 
0.01/cM, and the nearest recombination would terminate 
the shared segment. The probability n of the total seg- 
ment length, £, to exceed m is, given g, 
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According to coalescent theory in the Wright-Fisher 
model, under the continuous-time scaling g — > Nt the 
times to the MRCA are exponentially distributed with 
rate 1: $(£) = e _t . Therefore, 
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The total fraction of the genome found in shared seg- 
ments is 
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where I(s) is the indicator that site s is in a shared seg- 
ment, and the sum is over all sites. The mean fraction of 
the genome shared is 
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where (■) denotes the average over all ancestral processes. 
As expected, for mN — > oo, (/t) — > and for mN — > 0, 
(f T ) ->■ 1- For large TV, we have (/ T ) « 100/(miV). 



3. TTie variance of the total sharing 

We now turn to calculating the variance of the total 
sharing. Using Eq. (3), 
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where 7T 2 (si,s 2 ) is the probability that both markers Si 
and s 2 are on shared segments and n is given by Eq. (2). 
In the rest of the section, we assume that each individual 
carries one chromosome only, for if we have c chromo- 
somes, each of length L il then (if the two individuals are 
not close relatives) 
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and the variance is 
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where /t,(j) is the total sharing in chromosome i, as- 
sumed independent of the other chromosomes. Rewrit- 
ing 7T 2 (si, s 2 ) as ir 2 (k), where k is the number of markers 
separating si and s 2 , we have 
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All is left is to evaluate 7r 2 (fc), for which we provide three 
approximations. The first is presented below and the 
second (which is a variation of the first) is presented in 
Supplementary Section SI. 3. The third approximation, 
which is the most crude but which yields an explicit de- 
pendence on the population parameters, is presented in 
Section II A 4. 

In the first approach, we assume that once the times 
t\, t 2 to the MRCA at the two sites are known, the sites 
are (or are not) in shared segments independently of each 
other, and with probabilities given by Eq. (1). Clearly, 
this assumption is violated when both sites belong to 
the same shared segment, and in Supplementary Section 
SI. 3, we show how this assumption can be avoided (but in 
the cost of significantly complicating the analysis) . Nev- 
ertheless, it gives a good approximation, as we will later 



see (Figure 2). We can therefore use Eq. (1) to write 
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where <&{t\ , t 2 ) is the joint PDF (probability density func- 
tion) of t\ and t 2 and 
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is the Laplace transform of $(£i,t 2 ). We therefore re- 
duced the problem of finding ir 2 (k) into that of finding 
$(91,92). 

To find <&(£i , £2 ) (or rather, its Laplace transform), we 
use the continuous-time Markov chain representation of 
the coalescent with recombination [24, 26, 27] . The chain 
is illustrated in Figure 1. Initially (present time), the 
chain is in state 1, corresponding to two chromosomes 
carrying two sites each. The chain terminates at state 8, 
when both sites have reached their MRCA. To construct 
the chain, coalescence events were assumed to occur at 
rate 1 and recombination events at rate p/2, where p = 
2N-^jL/l00 is the scaled recombination rate [24]. 

Denote by Pi(t) the probability that the chain is at 
state i at time t, given that it started at state 1. The 
probability that the two sites have reached their MRCA 
simultaneously in the time range [t,t + dt] is P\(t)dt, 
since this is the product of the probability that the chain 
is at state 1 at time t (Pi(t)) and the probability of the 
transition 1 — > 8 in the given time interval (dt). The 
probability that only the left site has reached its MRCA 
(and the right site has not) in [t, t+dt] is P 2 (t)dt+ P 3 (i)dt: 
this corresponds to the transitions 2 — 5 and 3 — s> 7. This 
is also the probability that only the right site has reached 
its MRCA in [t, t+dt] (transitions 2 ->■ 4, 3 ->■ 6). Finally, 
the probability that the left site has reached its MRCA in 
[ti, t\ +dti] and that the right site has reached its MRCA 
in [t 2 ,t 2 +dt 2 ] (t 2 > tt) is [P 2 (fi)+-P3(*i)]<ftie -( * 2-tl) di2- 
This is true, because the exit rate from state 5 and 7 is 
1; therefore, the probability that the chain will wait at 
one of those states for time (t 2 — ti) and then leave to 
the terminal state is e~( t2_tl )dt 2 . Similar considerations 
apply for the case t 1 > t 2 (with the transitions 2 — )■ 4 
and 3 — > 6). In sum, for £1, £2 > 0, 

Hh,t 2 ) = Pi(h)5(t 2 - h) 

+ [P2(h)+P? > {ti)]e- (t ^ ) Q(t 2 -t l ) 

+ [P 2 (t 2 ) + J p 3 (t 2 )]e- (tl - t2) e(ti - t 2 ), 

where 5(t) is the Dirac delta function and 0(t) = 1 for 
t > and is otherwise zero. Laplace transforming the 
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tion rate from i to j 7^ i and Qi 



FIG. 1: An illustration of the continuous-time Markov 
chain representation of the coalescent with recombi- 
nation [24, 27]. Circles correspond to states, with the state 
number in a box on top of each circle. Arrows connecting 
circles represent transitions (solid lines: coalescence events; 
dashed lines: recombination events), with their rates indi- 
cated. The lines inside each circle represent chromosomes 
with two sites each. Ancestral sites are indicated as either 
small circles (as long as there are still two linages carrying 
the ancestral material) or crosses (whenever the two linages 
coalesced and the site has reached its MRCA). Transitions 
leading to the MRCA in one or two sites are colored brown. 
Transitions between states 4 and 6 and between 5 and 7 are 
not indicated, as they do not affect the final coalescence times. 
The schematic was adapted from [24]. 
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In the last equation, P^q) = / °° er qt Pi(t)dt (i = 1, 2, 3) 
are the Laplace transforms of P%(t). The Laplace trans- 
forms can be calculated using the general relation [28] 

Pi(q) = (ql-Q)u\ (9) 
where Q is the transition rate matrix: Qij is the transi- 
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Using Eqs. (8), (9), and (10), and Mathematica, 
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where A = (1 + gi)(l + q 2 ), B = (3 + qi + q 2 )(6 + qi + 
q 2 ), C = p(2 + Ql + q 2 ), D = 13 + 3( 9l + q 2 ) and E = 
p 2 (2 + qi + q 2 ). Eq. (11) was also derived in [29], using 
the birth-and-death-process equivalent of the coalescent 
with recombination, and can also be derived using the 
Feynman-Kac formula (see Supplementary Section SI. 4). 
Substituting, using Mathematica, Eq. (11) in (7) and 
then using Eq. (6) gives an expression for the variance, 



Var[/ T ] = J(N,m,L,M). 
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The function 'J is too long to reproduce here, but can be 
found in the Supplementary Matlab code (File S2). For 
further discussion on the approximations made, see Sup- 
plementary Section SI. 2. The standard deviation (SD) of 
the total sharing is defined as usual as af T = wVax [fr] ■ 
To evaluate the accuracy of our expressions for the 
mean and SD of the total sharing, we used the Genome 
coalescent simulator [30], along with an add-on that re- 
turns, for each generated genealogy, the locations of the 
segments that are IBD between each pair of individu- 
als [10]. The simulation results (see also Methods) are 
presented and compared to the theory in Figure 2. In 
each panel, we varied one of N, m, and L, keeping the 
two others fixed (as long as the marker density is large 
enough, the number of markers M has no effect on the 
variance). Across most of the parameter space, our ex- 
pressions agree well with simulations. Notable devia- 
tions, however, arise for the SD in particularly short or 
long chromosomes. For these cases, the second, more 
complicated approximation, which we mentioned above 
and appears in Supplementary Section SI. 3, is more ac- 
curate (Figure 2). 



4- An approximate explicit expression 

In this subsection, we derive another, simpler approxi- 
mation of the variance, one that is less accurate but that 
has an explicit dependence on the population and genetic 
parameters. The gist of this approximation is that the 
main contribution to the variance comes from the long- 
distance probability of pairs of sites to reside on the same 
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FIG. 2: The mean and standard deviation of the total 
sharing. For each parameter set, we used the Genome co- 
alescent simulator to generate a number of genealogies (from 
a population of size N and for one chromosome of size L), 
and then calculated the lengths of IBD shared segments be- 
tween random individuals. Each panel presents the results for 
the mean and standard deviation (SD) of the total sharing, 
that is, for each pair, the total fraction (in percentages) of 
the genome that is found in shared segments of length > m. 
Simulation results are represented by symbols, and theoreti- 
cal results by lines (Eq. (4) for the mean and Eq. (12) for 
the SD are plotted in solid lines; the approximate form for the 
SD, Eq. (15), is shown in dashed lines). A We fixed m = lcM 
and L = 278cM (the size of the human chromosome 1 [31]), 
and varied N. B Same as A, but with fixed N = 10000 and 
varying m. C Fixed N and m and varying chromosome length 
L. In this panel, we also plotted the result of an alternative, 
more elaborate calculation of the variance (dotted line; see 
Supplementary Section SI. 3). 



segment. Denote the distance between two given sites by 
d, and assume that d > m. For a given pair of individu- 
als, if there was no recombination event between the two 
sites in the history of the two linages, then both sites lie 
on a shared segment of length > d > m. Of course, even 
if there was a recombination event, the two sites could 
still be each on a different shared segment. However, this 
occurs with probability very close to 7r 2 , the probability 
that the two sites are on shared segments given that they 
are independent. 

In terms of Eq. (6), the above approximation trans- 
lates to, for d > m, 



7r 2 (fc)-7r wp nr , 
where p nT is the probability of no recombination, 

1 1 50 
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for distant sites where Nd 3> 50. This is true because in 
the ancestral process (Figure 1), no recombination cor- 
responds to a coalescence event taking place before any 
recombination event. Since the coalescence rate is 1 and 
the recombination rate is p, Eq. (14) follows. We can 
then further simplify and neglect the contribution to the 
variance from sites separated by short distance d < m. 
Finally, we can also neglect the single-site term of the 
variance, since it scales as l/M and therefore vanishes 
when the markers are dense. Overall, the simplihed Eq. 
(6) gives 
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for L 3> m. Nicely, Eq. (15) provides an explicit (and 
rather simple) dependence on N, L and m, and as ex- 
pected, the expression does not depend on the marker 
density. Eq. (15) is also plotted in Figure 2, showing 
that it fits quite well to the simulation results, although 
it is usually less accurate than Eq. (12). 

For the entire (autosomal) human genome, we use Eq. 
(5), 



Var [fr 



lOOES^m^) 



N 



For m w l(cM), the last equation gives 
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A variable population size. — The framework presented 
above can be extended to calculate the variance for a 
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generalization of the Wright-Fisher model in which the 
population size is allowed to change in time. Denote the 
population size as N(t) = N a \(t), where t is the time 
(scaled by No) before present. The PDF of the (scaled) 
coalescence time for two linages is (see, e.g., [32]) 
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As shown in [10] , the mean of the total sharing is obtained 
by substituting the above $(t) in Eq. (2), giving 
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For the variance, following Eq. (13), we need to calcu- 
late the probability of no recombination, p nl . For sites 
distance d apart, 
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since for coalescence time t the sites are separated by 
2Not meioses, in each of which the probability of no re- 
combination is e ~ d / lm . Eq. (18) reduces to Eq. (14) 
for \{t) = 1 (where = e"<). Eq. (18) can then be 
substituted into Eq. (15), giving 
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In Supplementary Section SI. 5, we work out an exam- 
ple of a linearly expanding population, where Eq. (18) 
was solvable and the integral of Eq. (19) was evaluated 
numerically. 

The total sharing in a length range. — Consider the 
quantity /t;£i,£ 2 , defined as the total fraction of the 
genome found in shared segments of length in the range 
[4,4]- Clearly, fr-^M = fT;m=e 1 - fT-m=e 2 , that is, the 
difference between the usual total sharing when m = 4 
and when m = £ 2 . The average is simply (/t;£i ,i 2 ) = 
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The covariance term can be expanded as 
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where I(s; m = t) is the indicator that site s is in a 
shared segment of length at least I, n m= £ is the proba- 
bility associated with the indicator, and ^(si, £±; S2, (-2) 
is the probability that site s\ is in a shared segment of 
length at least l\ and site S2 is in a shared segment of 
length at least £ 2 . The key approximation, similar to 
the one made in subsection II A 4 (Eq. (15)), is that 
T2(si, t\ \ S2, (2) — 7T m= 47r TO= ^ 2 is non-zero only when the 
two sites lie on the same segment and the segment is 
longer than £ 2 - Defining p nr , as before, as the probabil- 
ity of no recombination between si and S2 in the history 
of the two individuals, we have 
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where the last step follows from Eq. (15). Substituting 
Eq. (21) into Eq. (20), we obtain 
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For a constant population size, using Eq. (15) (taking all 
terms in that equation) gives 
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equation that was derived in [10] and then used for de- 
mographic inference. Here, we calculate the variance, 
Var [/t ; 4/ 2 ], as follows, 

Var [/t;4/ 2 ] = Var [/ T; m=^ - h-m=i 2 \ (20) 
= Var [f T . m=ll ] + Var [/ T; m=f 2 ] - 2Cov [f T ; m =t 1 , h;m=e 2 ]- 



Eq. (22) is compared to simulations in Figure 3, showing 
good agreement. Note that as long as 4,4 <S L, the 
variance depends only on the ratio 4/4- 



5. The total sharing distribution and an error model 

Having the first two moments of the total sharing, we 
sought to find its distribution, P(fr)- While we could not 
find an exact expression, we could find, inspired by the 
numerical results of [13], a reasonable fit. Huff et al. [13] 
showed empirically that for HapMap's Europeans [31], 
the number of segments shared between random individ- 
uals was distributed as a Poisson, and that the length of 
each segment was distributed exponentially with a lower 
cutoff at m, independently of the number of segments. If 
this is true also for the Wright-Fisher model, then the to- 
tal length of the shared segments, defined as Lt — Lfr, 
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FIG. 3: The standard deviation (SD) of the total shar- 
ing in a length range. Simulation results (symbols) are 
shown for the SD of the fraction of the genome found in 
shared segments of specific length ranges. The total sharing 
for each range was calculated for random pairs of individuals 
in Wright-Fisher populations of the sizes indicated in the leg- 
end. The SD is plotted vs. the starting point of each length 
range, i\ (where for each £i, the successive data point is £2). 
Note the logarithmic scale in the x-axis and hence that £2/(^1 
is fixed (equal to 1.5). Theory (lines) corresponds to Eq. (22). 



is distributed as a sum of a Poisson distributed number 
of these exponentials. In equations, 
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P(Lr) = X> 



n=0 



n! 



Prob{4+4+-+4 = L T }, (23) 



where Tin is the mean number of segments, the density 
of the £iS is exp[— (£j — m)/£ Q ]/£ (£ + m is the mean 
segment length), and £i > m. Such an expression is easier 
to handle in Laplace space, where the Laplace transform 
of P(L T ), P(s), is 



n ^0 



n=0 

exp 



n! [s£ + l] n 



-n 1 



s£ + l 



(24) 



and we used the convolution theorem. For given no and 
£q, P{Lt) (and then P{fx)) was uniquely determined 
from P(s) by numerical inversion [33, 34]. For specific 
values of (./V, L, m), we fitted the parameters no and £q 
by minimizing the squared error between the simulated 
distribution and P(fx) (from Eq. (24)) in a grid search. 
The results are plotted in Figure 4, with the fitted no 
and £q plotted in Figure S2. It can be seen that Eq. (24) 
captures quite well the unique features of P(fx) (except 
in the tail; see Figure S2). 

Inspection of the distributions (Figure 4) for several 
values of TV lead to some interesting observations. For 
small TV (e.g., TV w 1000, and for m = lcM and 



FIG. 4: The distribution of the total sharing. Simula- 
tion results (symbols) are shown for the distribution of the to- 
tal sharing between random pairs of individuals in the Wright- 
Fisher model. Details of the simulation method are as in Fig- 
ure 2A. A The distribution of the total sharing for N = 1000, 
3000, and 5000. For better readability, the x-axis (the total 
sharing fx) is given in percentages and scaled by A^/IOOO, 
shifting the distributions for N = 3000 and N = 5000 to the 
right. B The distribution of the total sharing for A^ = 8000 
and 16000. Here the x-axis is not scaled. In both panels, 
lines represent the fit to a sum of a Poisson number of shifted 
exponentials, Eq. (24). 



L = 278cM), where the typical amount of sharing is 
large ((/ T ) w 5 - 10%, n « 10, £ Q « lcM), the dis- 
tribution is unimodal (but not normal), centered around 
(fr)- As N increases (e.g., N « 3000), a discontin- 
uous peak appears at fr = 0, with P(/t) = for 
< fx < m/L(zx 0.4%). This is of course due to the 
restriction on the minimal segment length: a pair of in- 
dividuals can share either nothing or at least one segment 
of length m. For fx > m/L the distribution is continu- 
ous, still centered around (/t), but with small, yet no- 
table peaks at fx = m/L, 2m/L, 3m/L, ... corresponding 
to pairs of individuals sharing a small number of minimal 
length segments. For even larger TV (e.g., TV w 10000 and 
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beyond), (fr) drops below 1%, % « 1 (£o still around 
lcM), and the peaks at fx = and fx = m/L increase 
such that the distribution decreases almost monotoni- 
cally beyond m/L. An analytical bound on the fraction 
of pairs not sharing any segment is given in Supplemen- 
tary Section S2.1, Eq. S33. 

An error model. — To model errors during IBD detec- 
tion, suppose that we set m large enough as to avoid 
any false positives (i.e., detected segments that are not 
truly IBD). We model false negatives as true IBD seg- 
ments being missed with probability e (independent of 
the segment length). It is possible to extend the above 
formulation (Eq. (23)) to the case with errors, as fol- 
lows. Summing over the true number of segments, n' , 
the distribution of the number of detected segments, n, 
is 

n'=n 

n\ 

that is, a Poisson with parameter no(l — e). Then, as 
a sum of a random number of independent variables, 
the mean and variance of Lt are (Lt) = (n) (£) and 
Var [Lt] = (n) Var [£] + (£) Var [n] , where n is the num- 
ber of segments and I is the segment length. In our case, 



m=1(cM), L=278(cM), N=10000 



(Lt) = (l-e)n (£ +m), 

Var [Lt] = (l-e)n [lt + (l +mf] 



(25) 



demonstrating that in the presence of detection errors, 
both the mean and the variance of the total sharing are 
(1 — e) times their noise-free values. This is confirmed by 
simulations in Figure 5. 

Other approaches. — We note that a similar approach 
dates back to R. A. Fisher [35] and others [36-38] in 
their work on IBD sharing in a model where the popula- 
tion has been recently founded by a number of unrelated 
individuals. Briefly, those authors too assumed a Poisson 
number of IBD segments, each of which is exponentially 
distributed. They then matched the Poisson and expo- 
nential parameters to the average IBD sharing and the 
average number of segments, which they calculated us- 
ing their population model. Here, we used a different 
population model (the coalescent; see also Supplemen- 
tary Section S2.2) and assumed the exponentials have a 
cutoff at m. In principle, the parameters no and to can 
also be directly calculated, by matching the mean and 
variance of the total sharing; see Supplementary Section 
S2.3. In practice, however, this does not give a good 
fit. In [10], a similar compound Poisson approach was 
developed but with a different, coalescent theory-based 
approximation of the segment length PDF, leading to an 
improved fit of the remaining parameter no- 
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FIG. 5: The mean and standard deviation (SD) of the 
total sharing in the presence of detection errors. Sim- 
ulation results (symbols) are plotted for mean and SD of the 
total sharing in the Wright-Fisher model. Simulation details 
are as in Figure 2, except that each segment was dropped with 
probability e. Theory (lines) is from Eq. (4) for the mean and 
(12) for the SD, but where the mean is multiplied by (1 — e) 
and the SD by >/l — e, as in Eq. (25). 



B. The cohort-averaged sharing 



We have so far considered the total sharing between 
any two random individuals in a population. In prac- 
tice, we usually collect genetic information on a cohort 
of n individuals. In this context, we can attribute each 
individual with the amount of genetic material it shares 
with the rest of the cohort. Define, for each individ- 
ual, the cohort-averaged sharing Jt, as the average total 
sharing between the given individual and the other n — 1 
individuals in the cohort. Naively, one may anticipate 
that the width of the distribution of fx will approach 
zero for large n, because the averaging will eliminate 
any randomly arising differences between the individu- 
als. We show that in fact, the width of the distribu- 
tion approaches a non-zero limit. The individuals at the 
right tail of the cohort-averaged sharing distribution can 
be seen as "hyper-sharing", meaning they are, on aver- 
age, more genetically similar to members of the cohort 
than are others. Similarly, individuals at the left tail are 
"hypo-sharing" . The existence of hyper-sharing individ- 
uals is important for prioritizing individuals for sequenc- 
ing, as we will show in Section II C. 

Define the fraction of the genome shared by individuals 



i and j as /j, . 



The cohort-averaged sharing of i, /t 



(<) 



E 



f(i,3) 
Jt ■ 



9 



The variance of fa % is 



i n 



+ 



n-2 



n — 1 n 
+ Cov 



JT 

n 



^Cov [/™ /^J 

,(1,2) ,(1,3)1 
JT i JT ' 



(26) 



where we assumed n 3> 1 and used the fact that the 
covariance term is identical for all (i, ji, J2) combinations 
and therefore, for simplicity of notation, we set i = 1, 

h = 2, and n = 3. Recall that = £ ^=1 1 ( s ) 

(Eq. (3)), where I(s) is the indicator that site s is on a 
shared segment. Thus, the covariance can be written as 

Gov = 

1 M M 



= 1 S 2 = l 



M 2 



fc=l 



where I^>(s) is the indicator that site s is on a segment 

shared between individuals i and j, and n^ 2 ' 1 ' 3 ^ (k) is the 
probability that a given site is on a segment shared be- 
tween 1 and 2 and that another site, k markers away from 
the first, is on a segment shared between 1 and 3. As in 
Section IIA4 (e.g., Eq. (15)), we will keep only the most 
dominant term in the sum. Consider the coalescent tree 
relating the three individuals 1, 2, and 3, and assume 
that the distance between the sites is d > m. If there 
was no recombination event in the entire tree between 
the two sites, then immediately ^2 (A:) = 1. Other- 
wise, we assume that each of the two sites belongs to a 
shared segment (or not) independently of the other, that 
(k) w 7T 2 . The probability of no recombina- 



(1,2;1,3) 



tion, p nl , depends on T 3 , the total size of the tree of three 
linages. Since the PDF of T 3 is P(T 3 ) = e" T3 / 2 - e~ T3 
[24, 39], 



P„ r = / P(T 3 
Jo 



e -dNT 3 /100 dT = 



5000 

(50 + dN){100 + dN)' 



or, for dN > 100, 



5000 

{dN) 2 



The covariance becomes 



cov [fr\sr ] ] 
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(27) 



Substituting Eqs. (15) and (27) in Eq. (26) , the stan- 
dard deviation of the cohort-averaged sharing is 



n 10000 



1 + 



N 2 mL 



10 



\ 



In(-) 
nNL 



1 + 
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A^mln I 



(28) 



while for 



For (2<)n <C TVmln(^) /100, a-^ 
n > Nm\n (£) /100 (but < TV, as the size of the cohort 
size cannot exceed the population size), a-^ w 
which is independent of n. This implies that even for very 
large cohorts, the distribution of the cohort-averaged 
sharing retains a minimal width. Eq. (28) is in good 
agreement with simulations, as shown in Figure 6A (al- 
though some deviations are seen for larger n). We note 
that the variance was computed for a given individual 
over all ancestral processes of a cohort of size n. There- 
fore, the variance within the cohort, for a given ancestral 
process might actually be smaller. Simulations results 
(Figure S4), however, show that unless n is very small, 
Eq. (28) is a good approximation also for the variance 
within the cohort. 

For a genome with c chromosomes, 



Var [iV] 



E° = i^Var[/T, W ] 

(E- =1 ^) 2 ' 



where /t,(j) is the cohort-averaged sharing of chromo- 
some i. For the human genome and for small n and 
m « lcM, Eq. (16) gives 



0.382 



whereas for large n, Eq. (27) gives 

1.68 



(29) 



(30) 



which is, as explained above, independent of n. 

The fact that the width of the cohort-averaged shar- 
ing distribution does not approach zero for large n re- 
sults from the "long-range" correlations between the av- 
eraged (n — 1) variables, or in other words, the fact that 

Cov [Z^ ,/^] > for all i,j u j 2 - In [40], it was 
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FIG. 6: The cohort-averaged sharing. A Simulation 
results (symbols) for cr-^r, that is, the standard deviation 
(SD) of the cohort- averaged sharing (in percentage of the 
chromosome) vs. the cohort size n. The different curves 
correspond to different values of N (top to bottom: N = 
1000,2000,4000,8000,16000). The lines correspond to Eq. 
(28). Details of the simulations are as in Figure 2A. B The 
distribution of the cohort-averaged sharing. The fit is to a 
normal distribution having the same mean and SD as the real 
data. Also plotted is a normal distribution with mean given 
by Eq. (4) and SD given by Eq. (28). 



small probability (« 1 /N) of random pairs of individuals 
to be close relatives. 



C. Implications to sequencing study design 

Suppose we have sparse genotype information for a 
large cohort, as well as whole-genome sequences for a 
subset of it. If the genotype data allow detection of IBD 
shared segments, then alleles not typed can be directly 
imputed if they lie on haplotypes shared with sequenced 
individuals (see, e.g., [22]). In fact, given abundant IBD 
sharing, such a strategy is expected to be quite success- 
ful; as we mentioned in Section II A 1, only about one re- 
cent mutation is expected on each shared segment. Since 
some individuals share more than others, their sequenc- 
ing should be prioritized if imputation power is to be 
maximized. Recently, Gusev et al. [17] developed an 
algorithm (Infostip) for sample selection based on the 
observed IBD sharing. Here, using our results in Sec- 
tion II B, we calculate the theoretical maximal imputa- 
tion power. 

Assume first that individuals are haploids; the case 
of diploids is treated later. Assume a cohort of size n, 
a budget that enables the sequencing of n s individuals, 
and two selection strategies: either of random n s indi- 
viduals or of the n s individuals with the largest cohort- 
averaged sharing. Define an imputation metric, Pc , as 
the average fraction of the genome of i, an individual 
not sequenced, that is shared IBD with at least one 
sequenced individual. Let the selected individuals be 
mi, m>2, fn na , and denote the fraction of the genome 
that i shares with rrij as f^ m '\ To calculate pi\ we 
assume that for all j 1 ,j 2 = 1— ,n s (ji ^ j 2 ), /^' mjl) is 

independent of f^' mj2 ^ (which is justified, as we showed 
in Section II B). We also assume that the locations of 
the shared segments are independent and uniformly dis- 
tributed along the genome. Under these assumptions, 
the probability of a locus to be covered by at least one 
sequenced individual is 



found that when all pairs of random variables are weakly 
correlated, the PDF of their average converges to a nor- 
mal distribution. In our case, the covariance is « ™° 
(Eq. (27)), much smaller, for typical values of N, L, and 
m, than aj T w In (i) (Eq. (15)). The variables we 
average are therefore weakly dependent, as we also ob- 
serve in simulations (Figure S5). We thus conjectured 
that the distribution of the cohort-averaged sharing is 
normal or is close to one. This is confirmed by simula- 
tion results, as shown in Figure 6B. We note, however, 
that a small but systematic deviation from a normal dis- 
tribution exists in all parameter configurations we tested, 
in the form of a broader right tail and a narrower left tail 
than expected (Figure S5). This seems to be due to the 



n 



It 



(31) 



and this is also the average covered fraction of the 
genome. We note, however, that assuming that the loca- 
tions of shared segments are independent and uniformly 
distributed is mostly for mathematical convenience. Sim- 
ulation results (Figure S6) show that sharing tends to 
concentrate at specific loci, implying that Eq. (31) can 
be thought of as an upper bound (see Figure 7). When 
h « 1, 



1 — exp 



E 



f (i,m } ) 
JT 
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and for random selection of individuals for sequencing, 



P, 



(rand) 



1 - exp(-n s (f T )) ■ 



(32) 



where (fx) is given by Eq. (4). When selecting the 

highest sharing individuals, values of j^' m ' J ^ come from 
the right tail of the cohort-averaged sharing distribution, 

p(7t), 



SfJ T PUT)df T 



(high) 

T 



where f c and (^f^^J are the minimum and average, 

respectively, of the cohort-averaged sharing among the 
sequenced individuals (jy P(/r)d/T — n s /n). Since we 

argued in Section II B that P{fr) is approximately nor- 
mal with parameters (/t) and a-y (Eq. (28)), f c satisfies 



erfc 



fc - (fr) 



. .IT 

We can thus finally write 

p (high) w 1 _ exp 



= 2n s /r 



(high) ^ 



(33) 
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FIG. 7: Coverage of genomes not selected for se- 
quencing by IBD shared segments. We simulated 500 
Wright-Fisher populations with N = 10000, n = 100, and 
L = 278cM, and searched for IBD segments with length 
> m , = lcM. For each plotted data point, we selected n 3 indi- 
viduals either randomly or using Infostip. Then, for each of 
the n — n s individuals not selected, we calculated the fraction 
of their genomes shared with at least one selected individual. 
We plot (symbols) the average coverage over all individuals 
in all populations. Lines correspond to theory: Eq. (32) for 
random selection and Eq. (34) for Infostip selection. 



Before getting to simulations, we notice that in prac- 
tice, selection of exactly those individuals with the largest 
cohort-averaged sharing will not achieve the imputation 
power of Eq. (34). This is because the top sharing in- 
dividuals usually share many segments with each other 
and thus sequencing of all of them will be redundant 
(e.g., in the extreme case of siblings, both will appear as 
top sharing, but sequencing of both will add little power 
beyond sequencing just one). To avoid such redundan- 
cies, we selected the high-sharing (simulated) individuals 
using Infostip [17], which proceeds in a greedy man- 
ner, each time selecting the individual who shares the 
most with the rest of the cohort in regions that are not 
yet covered by the already selected individuals. We then 
compared the imputation power when individuals were 
selected either randomly or using Infostip. The results, 
shown in Figure 7, suggest good agreement between the 
theoretical Eqs. (32) and (34) and the simulations, at 
least as long as n s is not large (relative to n). For large 
n s , the coverage is lower than predicted, likely due to the 
non-uniform concentration of the shared segments. 

For a cohort of n diploid individuals (assuming phase 
can be resolved) we redefine the cohort-averaged sharing 
as 

- f « _ 1 f-r^- 1 ) , nri^A 

/T,dip = 2 yJT + JT j , 
n i) 

(where, e.g., /t ' is the cohort-averaged sharing of the 
first chromosome of individual i) and assume that the in- 
dividuals selected for sequencing have the largest diploid 



cohort-averaged sharing. Since the two terms in /r.dip 
are weakly dependent, 

<j- t (ri) w — =<r3-(2n), 

where crj^ is given by Eq. (28). The coverage metric p c 
is interpreted, as before, as the probability of a locus on 
a given chromosome to be in a segment shared with at 
least one sequenced chromosome. The theory developed 
above is still valid, provided that in Eqs. (32) and (34) 
n s is replaced by 2n s and that in Eq. (33) a-^ is replaced 
by a-f . 

J JT. dip 

Increase in association power. — Using our results for 
the power of imputation by IBD, we calculate below the 
expected subsequent increase in power to detect rare vari- 
ant association. We use the simple model of [41], in which 
we consider rare variants that appear in cases but not in 
any control, and assume that the causal variant is domi- 
nant. 

Assume that we have genotyped and detected IBD 
segments in a cohort of n c (diploid) cases and nt con- 
trols, and that we sequenced a subset of n s individu- 
als, out of which n c s are cases and nt )S are controls 
(n s = n c>s + n t s ). After imputation by IBD, a locus 
in a (diploid) individual not sequenced has probability 
p\ to be successfully imputed, where p c is given by Eq 
(32) or Eq. (34). For a given locus, we define the effective 
number of cases (controls), as the number of cases (con- 
trols) for which genotypes are known either directly from 
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sequencing or from imputation. Since there are n c — n c>s 
cases not sequenced and n% — Tits controls not sequenced, 



4 eff) 



n c , s + (n c - n c ^ s )pl(n c ,n CtS ), 
n t , s + {n t ~ n t)S )p 2 c {n t , n t>s ). 



(35) 



In the last equation we assumed, without loss of gener- 
ality, that cases can only be imputed using other cases, 
and vice versa. The probability of a variant to appear in 
exactly b cases but in no controls, under the null hypoth- 
esis that the variant assorts independently of the disease, 
is given by Fisher's exact test, 



P( cases only) = 



,(eff) 



(eff) (eff) 

rtc +n t 



Define Q as the threshold P-value and denote by b* the 
smallest integer above which P(cases only) < Q. When 
the causal variant carrier frequency in cases is ft, the 
probability of the variant to appear in b cases is binomial, 
and the power is, for a given Q, 



6=6* 



(36) 



In Figure S7, we plot the power vs. n cs , when the 
sequencing budget n s = n c s + n t s is fixed and for rep- 
resentative parameter values. In Figure 8, we plot the 
power vs. the carrier frequency for the optimal value of 
n c s . The figure demonstrates that the power increases 
by several fold when imputation by IBD is used. This is, 
however, an expected consequence of increasing the effec- 
tive sample size, and would likely be achieved with any 
imputation algorithm (e.g., [42]). The figure also shows 
an additional, slight increase when the highest sharing 
individuals are selected for sequencing. Thus, while it 
should be easy to identify the highest sharing individuals 
given a genotyped cohort (e.g., using Infostip [17]), and 
doing so will increase the association power, our results 
suggest that the gain in power over a random selection 
will be minor. 



D. Other applications of the variance of IBD 
sharing 

1. An estimator of the population size 

Assume that we have genotyped or sequenced a diploid 
chromosome of one individual and calculated fx, the frac- 
tion of the chromosome shared between the individual's 
paternal and maternal chromosomes. Can we estimate 
the effective population size? 

According to Eq. (4), (f T ) = ^"ff+ffi ■ Solving for 
N gives (see also [10]), 
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FIG. 8: Power to detect an association after imputa- 
tion by IBD. The figure shows the maximal power to detect 
an association, with and without imputation by IBD, and 
with sequenced individuals selected either randomly or ac- 
cording to their total sharing. The parameters we used were: 
N = 10000, L = 278cM (one chromosome), m = lcM, cohort 
size of 500 cases and 500 controls, a total sequencing budget 
of n s = 100 individuals, and a threshold P-value of Q = 0.01. 
For each carrier frequency /3, we computed the power for each 
pair of n c , s and nt, s (number of sequenced cases and controls, 
respectively) such that n CjS + nt, s = n 3 , and recorded and 
plotted the maximal power. The power was calculated using 
Eqs. (35) and (36), where in Eq. (35), p c was set to zero for 
the case of no imputation, or calculated using Eqs. (32) and 
(34) (random selection and selection by total sharing, respec- 
tively, and adjusted for diploid individuals). For the studied 
parameter set, imputation by IBD leads to a major increase 
in power. Proper selection of individuals for sequencing also 
contributes to the power but only slightly. 



for (fx) <C 1. This suggests the following estimator, 

100 75 

mfr rn 



N 



(37) 



Below, we investigate the properties of the simple esti- 
mator of Eq. (37). Using Jensen's inequality, it is easy 
to see that the estimator is biased, 



N 



100 

m 



75 



100 



75 
m 



N. 



The variance of TV is proportional to Var [1//t] 5 which 
we could not calculate, but could approximate as follows. 
Let us write N as 



N = 



100 



75 



m[{f T - 
100 

m (/t) 



</t)) + </t>] m 
1 lT-(h) \ 75 
</t> ) m' 



where we applied the Taylor expansion 



l+x 



assuming \fo — (/t) | *C (/t) (in which regime clearly 
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^iVy = N). Since additive constants do not contribute 
to the variance, the standard deviation is 



^ 100a fT ^mN*/> /ln(£) 
^~m</r> 2 ~ 10 V i ' 

where we used (} T ) w 100/(mJV) (Eq. (4)), and 
Eq. (15)) for <tj t . The effective population size can 
also be inferred using Watterson's estimator, which is 
N\y — S2/C211), where S2 is the number of heterozy- 
gous sites and fi is the mutation rate (per chromo- 
some per generation). Watterson's estimator is unbiased, 

N\v ) = N, and has variance (assuming no recombina- 



tion) Var [N w ] = [2^N + (2 f iN) 2 ]/4^ 2 « N 2 . Therefore, 
a N w / N ~ !. compared to cr^/iV « N 1 / 2 for the IBD 
estimator. 

Note that in practice, the proposed estimator is not 
very useful, as it diverges whenever fx = (which is 
common for large N). Suppose, however, that we have 
sequences for n (haploid) chromosomes, and that we have 
computed the total sharing between all pairs. Define 



form 



/(I 



The estimator now takes the 



N 



100 

mf T 



75 

m 



This is again an overestimate, ^ A^y > N. In Supplemen- 
tary Section S3, we show that is approximately 



5\/nL 



In I 



2n 



+ 



100 

JVm' 



(39) 



For comparison, in Watterson's estimator for n (haploid) 
chromosomes, /N « 1/lnn (for large and n), 
which decays to zero with increasing n slower than the 
IBD estimator. Simulation results, shown in Figures S8 
and S9, confirm the accuracy of Eq. (39) and show that 
the bias is limited to very small values of n. 

In the context of the error model of Section II A 5, in- 
troducing a probability e to miss a true IBD segment will 
decrease the average total sharing by (1 — e) (Eq. (25)). 
Consequently, Eq. (38) will estimate a population size 
approximately 1/(1 — e) (« (1 + e) for small e) larger 
than the true one. 



2. IBD sharing between siblings 

The total IBD sharing between relatives can usually be 
decomposed into sharing due to the recent co-ancestry 
and "background" sharing due to population inbreeding 
[13, 14]. While much is known about the distribution 
of sharing in pedigrees (e.g., [43]), less is known about 
the population level sharing, and relatedness detection 
algorithms (e.g., [13, 14]) estimate it empirically. In a 



different domain, the variance in sharing between rela- 
tives appears in theoretical calculations of the variance of 
heritability estimators [44]. Our results for the variance 
of the total sharing in the Wright-Fisher model (Section 
II A) can thus have practical applications if modified to 
account for recent co-ancestry. 

Here, we calculate the variance of the sharing between 
siblings by combining the approach of [44] with that of 
our Section II A 4. Assume that two individuals are sib- 
lings, either half or full — we will calculate, without loss 
of generality, only the sharing between the two chromo- 
somes that descended from the same parent — and de- 
note the fraction of sharing as fs- Assume as before a 
population of size N and one chromosome of length L. 
For a given marker to be on a shared segment, it can ei- 
ther be on a segment directly co-inherited from the same 
grandparent (probability 1/2), or otherwise on a segment 
shared between the grandparents (probability 7r/2, Eq. 
(2)). We ignore boundary effects near the sites of recom- 
bination at the parent. The mean fraction of the genome 
shared is therefore just (fs) = (1 + 7r) /2. The variance 
can be written as in Eq. (6) , 



Var [/s]«^£(M-fc) 



fc=i 



1 2 
7T2,s(fc) - - (1 +7T) 



(38) where n2,s{k) is the probability of two sites separated 
by k markers (or genetic distance o! = fc-g) to be on 



segments shared between the siblings. The probability 
that the two sites are both co-inherited from the same 
grandparent is 

P^ = \[{l-rf+r 2 ]=\{l + e-^), 

where r is the recombination fraction and we used Hal- 
dane's map function [44]. Also with probability p S ame; 
the sites are both inherited from different grandparents, 
and we use the expressions developed in Section II A 4 
for the probability of the sites to be in shared segments: 
7T2,s(fc) = 7r 2 + 0(d — m)p nT (where p nr w 50/ (N d) and 
@(x) = 1 for x > and is zero otherwise). With proba- 
bility (1 — 2p same ), one site is co- inherited and the other 
is not; in that case 7r 2 ,s(A;) = n. Approximating the sum 
as an integral and simplifying, we finally have 



r 1 { 7T (1 - e - xL / 25 ) 

Var [f s ] « 2 dx{l - x) I ^ '- 

? ^ ( m \ 50 

1 + TV 2 + 9 (X - — -^—r 

\ LJ NxL 



+ 



(40) 



1 + e 



-xL/25 



(1+tt) 2 



We solved Eq. (40) using Mathematica and summed 
over all chromosomes as in Eq. (5). The results for the 
mean and standard deviation (SD) of the total sharing 
between siblings are plotted in Figure 9 and compared to 
an outbred population where the grandparents are unre- 
lated. The SD in the outbred population overestimates 
the Wright-Fisher SD, in up to « 18% for N as small as 
500. 
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FIG. 9: IBD sharing between siblings in the Wright- 
Fisher model. We plot the theoretical mean and standard 
deviation (SD) of the IBD sharing between the (maternal 
only or paternal only haploid) genomes of siblings. Lines cor- 
respond to an outbred population (unrelated grandparents): 
the mean sharing is 50% and the SD is taken from [44] . Sym- 
bols correspond to the theory for Wright-Fisher model: the 
mean sharing is (1 + 7r)/2 (where n is given by Eq. (2)), and 
the SD is given by Eq. (40). We used m = lcM and the 
chromosome lengths of the autosomal human genome. Note 
that the y-axis is on the left side for the mean and on the 
right side for the SD. 



3. IBD sharing after an admixture pulse 



In this final subsection, we study the IBD sharing in 
a simple admixture model. In our model, a single popu- 
lation A of constant size N has received gene flow from 
population B, G a generations ago. We assume that gene 
flow took place for one generation only (hence, an admix- 
ture pulse), and further, that population B is sufficiently 
large that the chromosomes it donated to A share no de- 
tectable IBD segments. Denote the fraction of the linages 
coming from population A at the admixture event as a 
(fraction 1 — a coming from B), and let T a = G a /N be 
the scaled admixture time. We are interested in IBD 
sharing between extant chromosomes in population A. 

To approximate the mean IBD sharing in the sample, 
note that if admixture was very recent, then two chro- 
mosomes will be potentially shared only if both descend 
from population A, which occurs with probability a 2 . 
Therefore, the mean sharing is a 2 times its value without 
admixture. While this is a good approximation (Figure 
S10), it does not account for two chromosomes, one or 
two of which are from the external population B, having 
their common ancestor more recently than the admixture 
event. We therefore calculate the mean IBD sharing us- 
ing Eq. (17), using the following (non-normalized) PDF 



for the coalescence times 
*(t) = 



t < T a 
t > T„ 



(41) 



which gives 
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Note that this is just (fx) 



admix 



</r) r 



i adn 



+ (i 



a )T a . The first term corresponds to linages descending 
from population A; the second term corresponds to at 
least one of the linages descending from population B but 
where the linages have coalesced already in the hybrid 
population. The variance can be similarly calculated, by 
substituting Eq. (41) into Eq. (19), 



Var [h 



;/L 



(1-x) 



-t-txNL/50 



dt 



dx 







+ 2a 2 


f 




m/L 


^ ioo 


jln 









-t-txNL/50jj. 



1 + (1-Q 2 ) 



7 — In 



dx 

fmNT a 
V 50 



(43) 



where 7 is the Euler-Mascheroni constant, and we solved 
the integrals in Mathematica and later simplified un- 
der specific assumptions (see Supplementary Section S4). 
Eq. (43) usually predicts a variance slightly smaller than 
the case of no admixture. Simulation results are shown 
in Figure S10 for the mean and variance. While agree- 
ment is not perfect (as Eq. (19) is itself approximate), 
Eqs. (42) and (43) capture the main effects of changing 
a and T a . Note that the result of Eq. (42) implies that, 
for small T a and large TV, the observed mean IBD sharing 
is as if the population is of size w N/a 2 . 

A test for admixture. — . For recent admixture (small 
T a ), the fractions of ancestry vary among individuals 
[45, 46]. In our model, since a pair of segments is shared 
mostly when both descend from population A, some indi- 
viduals will share more than others merely due to having 
a larger fraction of A ancestry. In turn, this will increase 
the variance of the cohort-averaged sharing. This obser- 
vation suggests the following test for a recent gene flow 
into a population, (i) Extract IBD segments and cal- 
culate the mean fraction of total sharing over all pairs, 
fx, as well as the SD of the cohort-averaged sharing, 
o~-fo. (ii) Use Eq. (38) to infer the population size: 

N = 100/(m/V) — 75/m. (in) Simulate iVp 0p popula- 
tions of size TV; extract IBD sharing and calculate the 
SD of the cohort-averaged sharing in each population. 
(iv) The P-value for rejecting the null hypothesis of no 
admixture is the fraction of the -/V pop populations where 
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the SD of the cohort-averaged sharing was larger than 
the observed one. Note that the identity of the external 
population need not be known, nor are the admixture 
fraction and time; the test relies on admixture creating 
a gradient of ancestry fractions and hence an increased 
variability in the similarity between individuals. Simula- 
tion results are plotted in Figure S10, showing that for 
a P-value of 0.05 and G a = 5, gene flow with a « 0.9 or 
lower can be detected (a w 0.8 or lower for G a = 10). 
We stress that a broader than expected distribution of 
cohort-averaged sharing does not necessarily indicate ad- 
mixture, and there might be other factors responsible for 
the effect (see also the Discussion). We validated, how- 
ever, that IBD detection errors alone (as in the model 
of Section II A 5) as well as variable population size (in a 
simple two-size model) do not lead to significant P-values 
in the admixture test (Figure Sll). 

IBD sharing and admixture in the Ashkenazi Jewish 
population. — As our final result, we apply the admix- 
ture test to the real population of Ashkenazi Jews (AJ). 
Historical records, and recently also genetic studies, sug- 
gest that AJ form a genetically distinct group of likely 
Middle-Eastern origin. However, the AJ population was 
also shown to receive a significant amount of gene flow 
from neighboring European populations [47-51]. We an- 
alyzed a dataset of w 2600 AJ, details on which have 
been published elsewhere [10, 51] and are summarized 
in the Methods section. To detect IBD shared segments 
in the AJ population, we used Germline [2]. For 500 
individuals on chromosome 1, and with m = IcM, the 
average fraction of sharing over all pairs is w 4.4%, lead- 
ing to an estimated population size of N « 2200. The SD 
of the cohort-averaged sharing is 0.52%, higher than the 
SD in all 500 populations we simulated with a constant 
size N (typically 0.34%, maximum 0.41%). The recent 
history of Ashkenazi Jews, however, has likely involved 
bottlenecks and expansions, different from the constant 
size assumption. In [10], a population model was inferred 
based on the fraction of the genome shared at different 
segment lengths. The model's best estimate of AJ history 
is a slow expansion until about 35 generations ago, and 
then a severe bottleneck (effective population size of just 
270) followed a by rapid expansion to a current size of a 
few millions. As can be seen in Figure 10A and B, the 
model agrees well with the distribution of the fraction of 
total sharing over all pairs, but predicts a much narrower 
distribution of cohort-averaged sharing than the true one. 
Here too, in none of 100 simulated populations with the 
inferred demography was the SD of the cohort-averaged 
sharing as large as in the real data. These results, there- 
fore, suggest (based on the AJ population alone) that the 
AJ population was the target of a recent gene flow. To 
confirm that the increase in the variance of the cohort- 
averaged sharing is due (at least partly) to admixture, we 
ran an admixture analysis (Admixture [52]) comparing 
AJ to HapMap's CEU [31]. As can be seen in Figure 10C, 
the fraction of "AJ ancestry" is indeed highly correlated 
with the cohort-averaged sharing (Pearson's r = 0.59). 
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FIG. 10: IBD sharing and admixture in the Ashke- 
nazi Jewish (AJ) population. We detected IBD shared 
segments using Germline in chromosome 1 of n = 500 AJ 
individuals and compared to simulations of the demographic 
history inferred in [10]. A The distribution of the total shar- 
ing over all pairs. B The distribution of the cohort-averaged 
sharing. While the demographic model fits well the sharing 
distribution over all pairs, the distribution of the real cohort- 
averaged sharing is broader than in the model. C We used 
Admixture to calculate the admixture fraction of AJ indi- 
viduals compared to the CEU population. The "AJ ances- 
try fraction" of each individual is plotted against its cohort- 
averaged sharing. This panel shows results for the full dataset 
(« 2600 individuals). 
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III. DISCUSSION 

The recent availability of dense genotypes, together 
with sophisticated detection tools, have transformed IBD 
sharing into an increasingly important tool in population 
genetics. Here, we used coalescent theory to compute 
the variance and other properties of the total sharing 
in the Wright-Fisher model. For the variance, we sug- 
gested three derivations, one of which was more coarse 
but had a simple closed form that was later extended 
to populations of variable size. Investigating the cohort- 
averaged sharing, we discovered the curious phenomenon 
of 'hyper-sharing'. We showed how this can be exploited 
to improve power in imputation and association studies. 
We also calculated the variance of the total sharing be- 
tween siblings, and briefly considered some implications 
to the accuracy of demographic inference. We finally in- 
vestigated IBD sharing in a hybrid population and sug- 
gested a test for admixture based on the cohort-averaged 
sharing, which we then applied to the Ashkenazi Jewish 
population. We provide Matlab routines for the main 
results. 

Most of our analytical results depend on certain as- 
sumptions and simplifications, as specified in the indi- 
vidual sections and in Supplementary Section SI. 2. Ad- 
ditionally, in reality, the Wright-Fisher model and the 
coalescent are only approximations of the true ancestral 
process, and procedures such as phasing, IBD inference, 
and imputation are also prone to error. IBD detection 
errors will particularly affect our results for imputation 
and association studies (Section II C), and these results 
should therefore be considered as idealized upper bounds. 
The error model we introduced, where each IBD segment 
is missed with a certain probability, gives a sense of the 
effect of errors. Investigation of more detailed models, 
e.g., length dependent error rate for segment misdetec- 
tion or more realistic models for imputation and associ- 
ation studies, is challenging and left for future work. 

Prospects of our work are in a few fields. First, as 
shown in [10], theoretical characterization of IBD shar- 
ing can lead to new methods for demographic inference, 
which are expected to perform particularly well when in- 
vestigating the recent history of genetic isolates. Here, 
we expanded the theory of IBD sharing to compute the 
variance of the total sharing and the cohort-average shar- 
ing. This turned useful, for example, when we provided 
in Section II D 1 expressions for the variance of an esti- 
mator of the population size based on the average sharing 
over all pairs of chromosomes and in Section II D 3, where 
we described a test for recent admixture. In another do- 
main, understanding the distribution of sharing between 
relatives can improve the accuracy of relatedness detec- 
tion (Section II D 2). Other potential applications are in 
the detection of regions either positively selected or asso- 
ciated with a disease based on excess sharing, although 
more work is needed for these. Finally, our results pro- 
vide the first estimate for the potential success of impu- 
tation by IBD strategies (Section II C). We note that of 



course, once a given cohort has been genotyped, IBD can 
be calculated directly to estimate the expected success of 
imputation. However, in many cases, study design takes 
place before the actual recruiting and genotyping, and 
then, if a rough estimate of the population size is avail- 
able, our results can be invoked to estimate the amount 
of resources needed. 

One of our interesting findings was the presence of 
hyper-sharing individuals. While we did not define the 
term precisely, we referred to the fact that even for large 
cohorts, the variance of the cohort-averaged sharing does 
not decrease below a certain value. This result, while 
somewhat counterintuitive, follows naturally from the 
population model. In the real population of Ashkenazi 
Jews (AJ), we showed that the distribution of the cohort- 
averaged sharing is even broader, indicating possible ad- 
mixture, and indeed, we found that the cohort-averaged 
sharing is highly correlated with the Ashkenazi ancestry 
fraction. This is not to say that admixture was the only 
factor shaping the distribution of IBD sharing; other fac- 
tors such as selection or population substructure could 
have been playing a role as well. Our results, however, 
emphasize the importance of reconstructing the AJ de- 
mography simultaneously with that of their neighboring 
populations. 



IV. METHODS 

Coalescent simulations. — To simulate IBD sharing in 
the Wright-Fisher model, we used the Genome haploid 
coalescent simulator [30]. Recombination in Genome is 
discretized to short blocks and mutations (which we ig- 
nore in this study) are placed on the simulated branches. 
In all simulations, we generated one chromosome with 
recombination rate of 10 -8 per generation per base-pair 
and block lengths of 10 4 base-pairs (corresponding to res- 
olution of O.OlcM in the lengths of the shared segments). 

IBD sharing in simulations. — We used an add-on to 
Genome that returns, for each pair of chromosomes, 
the locations of all shared segments [10]. In that add- 
on, a segment is shared as long as the two chromosomes 
share the same ancestor, even if there was a recombina- 
tion event within the segment. We calculated, for each 
pair, the total length of shared segments longer than m, 
and divided by the chromosome size. For Figures 2-6, we 
simulated -/V pop > 100 populations and n = 100 haploid 
sequences in each population, and calculated all prop- 
erties of the total sharing among all N pop (™) available 
pairs. For the cohort-averaged sharing, we averaged, for 
each of the n chromosomes, their sharing to each of the 
other n— 1 chromosomes in the cohort, and then used the 
Ap p7i calculated values to obtain the variance and the 
distribution. Details on the simulation of an admixture 
pulse can be found in Supplementary Section S4. 

The Ashkenazi Jewish (A J) cohort. — The cohort we 
analyzed was previously described in [10, 51]. Briefly, 
DNA samples from « 2600 AJ were genotyped on 
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Illumina-IM SNP array. Genotypes (autosomal only) 
were subjected to quality control, including removal of 
close relatives, and phasing (Beagle [53]), leaving finally 
« 741,000 SNPs for downstream analysis. IBD sharing 
was calculated using Germline [2] with the following 
parameters: bits: 25, err_hom: 0, errjiet: 2, min_m: 
1, h_extend: 1. The results presented in Section II D 3 
remained qualitatively the same even when we used a 
longer length cutoff of m = 5cM. 

Admixture analysis. — For the admixture analysis, we 
merged the HapMap3 CEU population (Utah residents 
with ancestry from Northern and Western Europe) [31] 
(release 2) with the AJ data, removed all SNPs with po- 
tential strand inconsistency, and pruned SNPs that were 
in linkage disequilibrium [5]. We then ran Admixture 
[52] with default parameters and K — 2. Admixture 
consistently classified all individuals according to their 
population (CEU/AJ). Genome- wide, the AJ ancestry 
fraction was « 85%, compared to ~ 3% for the CEU 
population. Principal components analysis (SmartPCA 
[54]) gave qualitatively similar results. 

Simulations of A J demography. — Demographic recon- 



struction of the AJ population was performed in [10] us- 
ing chromosome 1 of 500 randomly selected individuals 
and using a novel IBD-based method described therein. 
Simulations presented here were performed using the fi- 
nal set of inferred demographic parameters: ancestral 
(diploid) population effective size of w 2300 individuals, 
expansion starting 200 generations ago reaching w 45000 
individuals 33 generations ago, a severe bottleneck of 
w 270 individuals, and an expansion to the current size 
of « 4.3 million individuals. Simulation of one hundred 
populations was carried out using Genome [30]. 
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Extended analytical results 



SI The variance of the total sharing 

Sl.l The expected number of recent mutations on a shared segment 

Consider a segment shared IBD between two individuals. Regardless of the segment length, the two indi- 
viduals are expected to differ in w 1 site along the segment. This is because for a pair of individuals with 
MRCA g generations ago, the shared segment is of typical length 100/(2g) cM (see, e.g., Mean total sharing 
section in the main text). The number of recent mutations per cM is 2gn, where fj, is the mutation rate per 
generation per cM. The total number of differences is therefore approximately 



For the human genome, /i « 10 8 per generation per bp [1], or w 0.8 • 10 2 per generation per cM (1MB 
corresponds roughly to 1.25cM). The number of difference is therefore around 1. 

SI. 2 The assumptions underlying derivation of the variance of the total sharing 

We summarize below the assumptions made when calculating the mean and the variance of the total sharing 
(main text Mean total sharing and The variance of the total sharing sections) . 

1. The population is Wright-Fisher with constant (effective) size N. We do not distinguish between male 
and female history, and all present-day individuals are represented as random pairs of haploids from 
the current generation. 

2. The ancestral process is described by Kingsman's coalescent [2]; specifically, time is assumed to be 
continuous, and the distribution of coalescence times is exponential with rate 1. 

3. Recombination is a Poisson process with rate 0.01 per cM. 

4. The recombination rate between markers is proportional to the genetic distance between the them. 

5. The markers are equally spaced, in genetic distance, along each chromosome and are dense enough, 
that when calculating the probability that a segment has length > m, we can ignore the discreteness 
of the markers. 

6. If two sites are on different chromosomes, they are shared or not independently of each other. 

7. Boundary effects at the ends of the chromosomes are ignored. 

8. We assume that the events that two sites are in shared segments are independent once we specify the 
time to the MRCA at each site. 

Assumptions 1, 2, 3, and 4 are standard when studying finite, isolated populations [2]. Assumption 5 
should present no problem in practice, with SNP arrays covering over a million sites or with whole-genome 
sequences. For assumption 6, we can, approximately, expect segments on different chromosomes to be shared 
independently of each other if the individuals are sufficiently unrelated that the average number of segments 
shared genome- wide is less than one, which is true for 4th (half-) cousins or less related individuals [3]. 
Assumption 7 is reasonable when L > ra (L is the length of the chromosome, m is the minimal segment 
length). 



# differences 



100 

~2g 



2g(i = 100/i. 



(1) 
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For the last assumption (8), one may suggest that if there was no recombination event in the history of 
two sites, then they are not independent. The reason why our approximation works is that when the two sites 
have the same coalescence time, it is usually very short (otherwise there would have been a recombination 
event and the coalescence times would not be the same in the two sites), increasing the probability that 
they lie on shared segments. If the sites have different coalescence times, the times would tend to be longer, 
reducing the probability that the sites are on shared segments, in accordance with the fact that they were 
separated by a recombination event. 

One importance of the derivation presented in the main text is that it sets the framework for a more 
detailed calculation that eliminates the last assumption. It does so by conditioning the probability 7T2(si, S2) 
on whether or not there was a recombination event. For each case, it then proceeds using the Markov chain 
representation of coalescent with recombination. This is explained in the next subsection. 

SI. 3 An alternative calculation of the variance of the total sharing 

In this subsection, we recalculate the probability ^(si, S2) = ^(k) of two sites separated by k markers to 
be both on shared segments of length > m. We use the Markov chain illustrated in Figure 1 of the main text 
as well as other notation as used in the main text. As mentioned above, we calculate %2 by conditioning on 
whether or not the two sites have been separated by a recombination event, 



where p ur is the probability of no recombination, 7r nr is the probability of both sites to be in shared segments 
when there was no recombination, and 7r r is the probability of both sites to be in shared segments when 
there was recombination. 

To calculate the probability of no recombination, we consider the discrete time Wright-Fisher model 
(as we found that it matches better the discrete-time simulations). In discrete time, the PDF of g, the 
number of generations to the (single-site) MRCA, is geometric, P(g) = ^ (l — jj) 9 . Given an MRCA at 
generation g at one site, we require that there was no recombination between that site and the other site, in 
both chromosomes, and in all g generations. Because recombination is a Poisson process and the distance 
between the sites is d = k -h , there will be no recombination with probability 



The scaled recombination rate p was defined as in the main text as p = 2A r d/100 [4]. 

Consider now the no-recombination probability, 7r nr . As long as d > m, 7r nr is trivially 1. If d < m, the 
segment spanning the two sites is of length d + £i£ +I2R, where 1\l is the distance to the next recombination 
event to the left of the left marker, and similarly for Iir (see Figure 1 for illustration). Given that the 
coalescence time (at both sites) was t, both I^l and Iir are exponentially distributed with rate 2Nt/\QQ. 
The PDF of the coalescence time is $(£) = (1 + p)e~ < - 1+p ) t , since this is the PDF of the time to exit state 1, 
and we are given that there was no recombination before coalescence. Therefore, 



7T2 = PnrTTnr + (1 ~ Pnr)^i 



(2) 




(3) 





These integrals are easily solvable, giving 





It is easy to see that lim^. 



.+ 7T, 



lim d _ >m - 7r nr , as expected. 
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Figure 1: An illustration of the shared segments spanning two sites (numbered 1 and 2). Lines correspond to 
chromosomes and circles to sites, which are distance d apart. The shaded boxes correspond to hypothetical 
shared segments. The left segment extends to distance £ir to the right of the site and £\l to the left of it, 
and similarly for the right segment. 



The case of recombination is more complicated. One might think that if there was a recombination 
event in the history of the two sites, then the two sites will be shared or not independently. However, the 
presence of a recombination event implies that the sum of £m and i^L [(the segment length to the right of 
the left marker) and (the segment length to the left of the right marker)] cannot exceed d (see Figure 1 for 
illustration) . We simplify the analysis by assuming instead that each of those two segments cannot exceed 
length d, but that their lengths are otherwise independent, resulting in a slight overestimation of tt t . Thus, 
for a given time to MRCA, t\, the segment length spanning the left site can be written as l\ = £\l + £%r 
(see Figure 1), where 1\l is distributed exponentially with rate Nti/50, 



NI-: I, 



P(h L ) = ^-e ^ : f [L (). m 



and i\R is similarly distributed, except for an upper cutoff at 1\r = d, 



Nt 
50 



P{tm) = - m wTTir ; 0<l 1R <d. (7) 



Using convolution, the probability density function of L\ = 1\l + t\R is 



1 m 



2 iVtjl! 



pm = "»' \Z • ; £ ;< d : (8) 

1 - e — 55- [^i «i > d. 
The probability that t\ > m and thus the site is on a shared segment is 

1 [d^Sre — ^ d<m 

Nt 
50 



l — e —gh (l + m^J-)e so - e so d > m. 



For large d, P(l\ > rri) — > (l + m^ L ) e so , which is exactly the single-site expression (Eq. (1) in the 
main text), as expected. We then simplify again by approximating the denominator of P(l\ > m) with 1, 

( Mi — Nt l m 

P{l 1 >m)*>\ 50 ' (10) 

[ (1 + e so — e so d> m. 

This should lead to a slight underestimation of 7r r . From here on the calculation is exact. An equation 
identical to (10) holds for P{li > m). Integrating the probabilities of the two sites to be in shared segments 
over all possible coalescence times, we have, for d < m, 

Nt-I «!tn NU Wt 2 m 

§{h,t 2 )d-^e ^d-^e S6~ dt x dt 2 . (11) 
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For d>m, 

poo poo r / 

7Tr = J J $(tl,i 2 ) u + 

As in the main text, this can be rewritten naturally in terms of the Laplace transform of $ 

pOO pOO 

S(?i,<Z2)= / / e-^-^Qih^dhdti. 
Jo Jo 

After some algebra, we find, for d < m, 



dhdt 2 - (12) 



(13) 



7T r = d 



drn[~d^2 V 50 ' 50 J 



mi=m 
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For d> m, 
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(15) 



We are therefore left only with finding <I>(gi, g 2 ). This can be carried out almost as in the main text, except 
that we must take into account that there was recombination before coalescence, that is, the Markov chain 
jumped from the initial state 1 to state 2 and not to state 8. Therefore, the coalescence times at the two 
sites, t\ and t 2 , can be seen as a sum of t' , the time it took to jump from state 1 to state 2, and the times 
it took from state 2 until coalescence events occurred in both sites. As we explained just before Eq. (4), 
the time it takes to jump from state 1 to state 2, given recombination, is distributed exponentially with rate 
(1 + p). Therefore, 



$(t 1 ,t 2 )dt 1 dt 2 = < 



' r o 4l (l + p)e- ( - 1 +P^' P 21 (h - t')6(t 2 - t x )dt'dt x dt 2 h = t 2 , 

J* 1 (1 + pje-a+rif [P 22 ( h - f ) + P 23 (h - t')\ e-to-Vtfdttdh h < t 2 , 
f t2 (l + p)e-<- 1 +P) t ' [P 22 (t 2 - t') + P 23 (t 2 - t')\ e-^-^dt'dhdh t 2 < t!. 



(16) 



In the last equation, P 2 i(t) is the probability of the chain to be at state i at time t, given that it started 
at state 2. The reasoning behind the last equation is as follows. In the case t\ = t 2 , to coalesce at both 
sites at time t 1 , we need to wait time t' to jump to state 2, then be back in state 1 after another period of 
(ti — t') (probability P 2 i(t\ — t')), and then jump to state 8 (probability dt\). To coalesce at site 1 (the left 
one) only at time t\, we need to wait time t' to get to state 2, and then be at state 2 (or 3) at time (t\ — t') 
(probability P 22 (ti — f) or P 23 (ti — f)) and jump to state 5 (or 7; probability dt\). Then, coalescence at site 
2 (the right one) at time t 2 > t\ occurs with probability e - ^ 2- * 1 ' dt2- The case t\ > t 2 is similarly explained. 
Taking the Laplace transform of the last equation, 

poo poo pt\ 

*(<Zi,«a)=/ / e-" 1 ' 1 " 92 * 2 / (l + p)e-^ t 'P 21 (t 1 -t')6(t 2 -t 1 )dt'dt 1 dt 2 (17) 
Jo Jo Jo 

f 1 (1 + p)e-^ t ' [P^ih - t') + P 23 (h - t')] e-^-^dt'dhdh 
Jo 

/ t2 (1 + p)e- {1 +^' [P 22 (t 2 - t') + P 2?> (t 2 - t')] e-^-^dt'dhdh. 
Jo 



OO pOO ft\ 
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lo Jt 
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Jo Jt 2 
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The first term of the right-hand- side can be solved as follows, 

pOO pOO ft 



(1 + P) 

l + P 



/ \ {l + p)e-^ t 'P 2 i(t 1 -t')dt'dt 1 5(t2-t 1 )dt 2 

Jo Jo 

poo r ftx 

/ e -(9i+»)«i / e -a+p)*p 21 (t 1 _t')df' 
Jo L</o 



1 + P + Qi + 12 



dh = 

P 2 l{qi+q 2 ). (18) 



The last line results from the special structure of the integrals in the second line: the internal integral is a 
convolution between e~ < - 1+p ^ lt and P 2 i(t), and the external integral is the Laplace transform t\ — ¥ (gi + q 2 ) 
of the internal integral. Applying the convolution theorem (recalling that the Laplace transform of e~ at is 
(a + g) -1 ), we arrive at the last line. The second and third terms of Eq. (17) require more algebra but are 
solved similarly, finally giving 

$(gi,g2) = : - 1 r hP , {(t-T— + T^) \P*2(qi + to) + ?2a(qi + to)] + Ai(«i+«4)}. (19) 
1 + p + qi + q 2 l\l + qx 1 + q 2 J L J J 

By that we are almost done, since as in the main text, the Laplace transform of the transition probabilities 
P 2 i{q) can be readily found using the continuous-time Markov chain relation 

P7i(q) = (qI-Q) 2i L , (20) 

where Q is the transition rate matrix of the chain. Substituting, using Mathematica, Eq. (20) in Eq. (19) 
gives 

g, (1 + P ){2(6 + g )[3 + gi(4 + gi) + g 2 (4 + g 2 ) + 3gxg 2 ] + p(2 + g )(13 + 3g) + p 2 (2 + g)} 

l9l,92j (l + gi )(l + g2 )(l + g + p)[2(l + g)(3 + g)(6 + g)+p(2 + g)(13 + 3g) + p2( 2 + g )] ' 1 j 

where g = gi + q 2 - We then substituted, again using Mathematica, Eq. (21) in Eqs. (14) and (15) to 
obtain the final expression for 7r r . We verified numerically that lim d ^ m + 7r r = lim^^- 7r r . Eq. (5) for 7r nr , 
Eq. (3) for p nr , and Eq. (2) for n 2 complete the derivation. 

SI. 4 An alternative derivation of &(qi,q 2 ) using the Feynman-Kac formula 

In this subsection, we show how $(gi, q 2 ) (Eq. (11) in the main text and Eq. (21) here) can be derived using 
the Feynman-Kac formula as described by Fitzsimmons and Pitman [5]. We thank an anonymous reviewer 
for pointing out this approach. 

Let us start with Eq. (11) in the main text. Assume the same continuous-time Markov chain as in the 
main text, and define a functional of the Markov chain as A v = J Q v(X t )dt, where X t is the state of the 
chain at time t, T is the "killing" time when the chain reaches an absorbing state (in our case, state no. 
8), and v(x) assigns a value to each state. With this notation, the Laplace transform &(qi,q 2 ) (for the case 
analyzed in the main text, when there is no restriction on the first transition) can be written as 



poo poo 

5(91,92)=/ / e-^-^ t ^(t 1 ,t 2 )dt 1 dt 2 = (e A ^}, 
Jo Jo 



(22) 



with v = — (gi + q 2 , q\ + q 2 , q\ + q 2 , gi, q 2 , gi, q 2 ) T ■ This is true, because the left-site coalescence time t\ is 
the total time spent by the chain in states 1,2,3,4, and 6, whereas the right-site coalescence time t 2 is the 
total time spent in 1,2,3,5, and 7. 

According to the Feynman-Kac formula [5], 

$(gi,g 2 ) = (e Av ) = \(Q' + m v )- 1 Q'i, (23) 
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where M v = diag(v), A is the initial condition (in our case, A = (1,0,0,0,0,0,0), since the chain always 
starts at state 1), and 1 = (1, 1, 1, 1, 1, 1, 1) T . The matrix Q' is obtained from the transition rate matrix Q 
by removing the row and column corresponding to the absorbing state (state 8). Carrying out the necessary 
matrix multiplications and inversions, we obtain the exact same expression as in Eq. (11) in the main text. 

In the case analyzed in Section SI. 3 above (leading eventually to Eq. (21)), the chain is guaranteed 
to jump from state 1 to state 2 (but not to state 8) at rate (1 + p). This can be incorporated into the 
Feynman-Kac framework by extending the chain to include a "ghost" state 0, from which the only outward 
transition is to state 2, at rate (1 + p). No transitions are allowed into state 0, and it is the initial state 
of the chain. Since neither site has coalesced while in state 0, we can write $(^1,92) = ( eAv ) with v = 
-(qi + q2,qi + q2,qi + q2,qi + q2,qi,q2,qi,q2) T - We then use (e A ") = \{Q" + M v )~ l Q"l, where A = 
(1, 0, 0, 0, 0, 0, 0, 0) and Q" is equal to Q' , but with an additional row and an additional column for the new 
state 0: 

/ -1- p 1 + p 0\ 



Q" 



V 

Solving and simplifying gives Eq. (21). 



Q' 



(24) 



SI. 5 A linearly expanding population 

In this subsection we calculate the mean and the variance of the total sharing for a linearly expanding 
population. Define the population size as N(t) = N X(t), where 



A(t) 



l + r(t -t) 0<t<t a , 
1 t > t . 



(25) 



This corresponds to a population maintaining a constant size until t — t generations ago; starting at t = to 
and until present, the population grows linearly at rate f. The PDF of the coalescence times is 



*(*) = - 

Substituting X(t) from Eq. (25), we have, for t < t a , 



rt dt' 
JO Aft 7 ) 



(26) 



1 



^ )= l + ^o-0 6XP 

= l+4o-O eXP ln 



dt' 



l + f(to-f) 
l + r(t -t) 



1 + fto 

= (l + rto)- 1/f [l + r(to-t)] 1 l f -\ 



(27) 



For t > to, 



$(<) = exp 



exp 



dt' 



1 + f (to - t 



-If] 



V— 

f \l+rt 



(t-to) 



= (l + rt y L/r e 
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(28) 
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In summary, 



* (t )Hi+«.)-'" tor*"- 1 :^'». m 



We then use Eq. (17) from the main text for the mean total sharing, 



</t>=/ $(t)[l + ^je-^dt, (30) 



50 



and Eq. (19) from the main text for the variance of the total sharing, 



/>1 r />oc 

Var [f T ] « 2 / (1-x) ${t) e - txNoL / r '°dt 

Jm/L [JO 



dx. (31) 



The integral in (fx) and the internal integral (over t) in Var [/y] can be evaluated in terms of incomplete 
Gamma functions (not shown) . For Var [fx] , the external integral must be evaluated numerically. [We also 
tried to change the order of the integration in Eq. (31), that is, to compute the integral over x first. However, 
in that case, while the integral over x was solvable, the integral over t was not.) We compare the results of 
Eqs. (30) and (31) to simulations in Figure SI. In the simulations, the ancestral population size was set to 
N a = 10000, the expansion started E t = 500 generations ago, and the final (current) population size varied 
in the range N c = [10500, 15000]. In terms of the parameters of \(t), this corresponds to N = N a = 10000, 
t = 500/10000 = 0.05, and r between (1.05 - l)/0.05 = 1 and (1.5 - l)/0.05 = 10. The comparison shows 
reasonable agreement with deviation of up to about 10%. 

S2 The distribution of the total sharing 

This section provides some additional results and discussion on The total sharing distribution and an error 
model section in the main text, in which an approximation to the distribution of the total sharing was 
presented. 

S2.1 A bound on the probability of no sharing 

A bound on the probability of no sharing, P(fx = 0), can be obtained directly from the one-sided Chebyshev 
inequality, 

P(fT<{f T )-a)< ^^2- (32) 

/t 

Substituting a = (fx) and noting that P(fx < 0) — P(fx = 0) immediately gives 

Hfr = 0)< 2 °/ T 2 . (33) 
v) T + </t) 

In practice, however, this bound is not very tight, as can be seen in Figure S3. 



S2.2 IBD calculations in the founder model 

The total sharing distribution and an error model section in the main text presented results for the distribu- 
tion of total sharing assuming it is a sum of a Poisson distributed number of segments. Early calculations of 
the distribution of the total sharing were performed in a different population model, where a group of unre- 
lated individuals is assumed to have recently founded the population. The distribution of the total length of 
the IBD shared segments was calculated, under somewhat strong assumptions, using renewal theory [6, 7]. 
In their model, it was assumed that if a region is not shared IBD, it is fully heterozygous (because it is 
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derived from different founders). In reality, however, all segments descend from a common ancestor at some 
point in the past, but the common ancestor of some segments is so ancient that they are too short to be 
detected. Our coalescent-based approach takes just that into account, by considering as IBD only segments 
longer than a certain length threshold. 

S2.3 Matching the Poisson and exponential parameters 

The parameters of the Poisson approximation, Eq. (24) in the main text, can be obtained by matching the 
first two moments of the total sharing distribution. The mean and variance of the Poisson approximation 
are given by (see, e.g., the main text Eq. (25)) 

(L T )=n (e + m)=L(f T ), 

Var [Lt] = n [£ 2 + (£ + m) 2 ] = L 2 a% , (34) 

where (fx) is given in the main text Eq. (4) and <jf T is given by one of the previously calculated approxi- 
mations, e.g., the main text Eq. (15). Solving for n n and £ in terms of (fx) and a f T gives 

_ Lo% -2{f T )m + a 

£ ° ~ Hfr) ' 

Laj T +2(f T )m- a 

2m*/L ' (35) 

where a = ( 4 (fx) Lma 2 T + L 2 aj T — 4 (fx) m 2 \ . In practice, we found that using Eq. (35) matched 

well the distribution P(fx) only when we underestimated cr/ T by 20-30%, probably because of the absence 
of the broad tail in Eq. (24). Therefore, in Figures 4 in the main text and S2 here we used the fitted values 
of no and £q. 

S3 An estimator of the population size 

In this subsection, we derive Eq. (39) in the main text for the variance of an estimator of the population 
size that is based on the average sharing between all pairs in a cohort. For a cohort of size n, define 

^=E™=iE;> i /^' ) /©,or 

fr = ^ . (36) 

The estimator takes the form 

A^iE-I^. (37) 



mf; 



T 



rn 



The SD of N can be approximated as in the main text, 



^ m / . \ 2 ' 

, JT 



(38) 



In fact, this approximation is better justified here than in the main text, as the distribution of fx is much 
narrower than that of fx- Using ^/ry = (It) ~ 100/(miV) gives 

mN 2 a= 
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We therefore need to calculate the variance of fx, from which we will then obtain the standard deviation 
a=. The variance of fx can be written as 

JT 



Var [fx] = var term + cov term, 



(40) 



where the var term corresponds to the variances of the individual terms in the sum in the definition of fx 
(Eq. (36)), and the cov term corresponds to the covariances of these terms. More concretely, using Eq. (36), 



var term = 



2a 2 

ST 



2-100 / L 
n 2 NL 11 m 



where we used Eq. (15) in the main text for a 2 T . The covariance term is 



cov term = 



Cov 



Aid) f {k,l)] 

JT tJT 



(41) 



(42) 



Note that the set (i,j,k,l) must have at least three distinct indexes. In most combinations of (i,j,k,l), 
we will have all i,j,k,l different, for which we assume that the covariance Cov ''\ /^'^j is zero. We 

therefore have to consider only covariances of the form Cov fx '*^J and Cov [Vr J \ /r '^j • Since for 

each pair (from which we have (")) there are (n — 2) possible ks, we have 



cov term : 



(»)2(n - 2)Cov [f™, f™] 4Cov [f™,f™] 4 . 1000 



n 



nN 2 mL ' 



where we used Eq. (27) in the main text for Cov \f T ' , f 



f(l,2) ,(1,3) 



Var [f 7 



2 • 100 / L \ 4 • 10000 _ 400 
n 2 NL n [m) + nN 2 mL ~ nNL 



In total, the variance of fx is 

L - 



Mi) + i«o 

2n Nm 



and 



Finally, 



20 



h y/nNL V 2n 



In ( 



100 

Nm' 



mN 2 a. 



100 

which is precisely Eq. (39) in the main text. 



5vnL 



2n 



+ 



100 

Nm' 



(43) 



(44) 



(45) 



(46) 



S4 An admixture pulse 

In the main text, an approximate solution was given for the integral in Eq. (43). The full solution is: 



Var [fx] « 2 f (1 - x) e~ t - txNL l hQ dt dx + 2a 2 f ( 

Jm/L JO Jm/L 

2 



100 



L 2 N 2 T, 



(l-x) 



' mf L 



e -t-txNL/S0 dt 



dx 



50(1 - a 2 ) 



T a (50 + NL)\ ( T a (50 + Nm)\ 
I -„ J " exp ( ,,, J 



50 



+T a (50 + NL) In + T a (50 + NL)(1 - a 2 ) 



50 

T a {50 + Nm) 
50 



NT a (L-m) 

r ( T a (50 + NL) ^ 



50 



(47) 
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where Ei(x) is the exponential integral function. To obtain the simplified equation (43) of the main text, 
we assumed T„«l (or G a = NT a < TV), m <C L, mNT a <C 50, mN > 50, and LNT a > 50, and used the 
series expansion of the exponential integral. For the parameters for which we plotted the simulation results, 
the simplified expression deviates in no more than 1% from the full expression. 

Simulations for the case of pulse admixture were performed using Genome as described in the main 
text, with the following population history. The initial (current) population size was set to N , followed by 
splitting to two populations, at generation G a , of relative sizes Na/(1 — a) and N, such that a fraction 
a of the linages descends from the first population (we could not find a way to implement the gene flow 
in Genome while keeping the population size fixed). At the next generation, the first population size was 
reduced back to N, and the second was increased to 10 6 , to practically eliminate IBD sharing within the 
second population. At generation 10 4 , the two populations were merged into a single population of size N, to 
enable all linages to coalesce. Simulation results are presented in Figure S10A. Each data point corresponds 
to 500 runs. The apparent noise for large a might be due to this somewhat unnatural admixture model 
implementation. 

References 

[1] P. D. Keightley. Rates and fitness consequences of new mutations in humans. Genetics, 190:295-304, 
2012. 

[2] J. Wakeley. Coalescent Theory: An Introduction. Roberts & Company Publishers, Greenwood Village, 
Colorado, USA, 2009. 

[3] A. Gusev, J. K. Lowe, M. Stoffel, M. J. Daly, D. Altshuler, J. L. Breslow, J. M. Friedman, and I. Pe'er. 
Whole population, genome-wide mapping of hidden relatedness. Genome Res., 19:318-326, 2009. 

[4] R. R. Hudson. Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol., 
23:183-201, 1983. 

[5] P. J. Fitzsimmons and J. Pitman. Kac's moment formula and the Feynman-Kac formula for additive 
functionals of a Markov process. Stock. Proc. AppL, 79:117-134, 1999. 

[6] P. Stam. The distribution of the fraction of the genome identical by descent in finite random mating 
populations. Genet. Res., 35:131-155, 1980. 

[7] N. H. Chapman and E. A. Thompson. A model for the length of tracts of identity by descent in finite 
random mating populations. Theor. Pop. Biol, 64:141-150, 2003. 

[8] Y. Shen, R. Song, and I. Pe'er. Coverage tradeoffs and power estimation in the design of whole-genome 
sequencing experiments for detecting association. Bioinformatics, 27:1995-1997, 2011. 



S. Carmi et al. 



11SI 



File S2 



Supplementary Code 



MATLAB code for the main results 
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m=1(cM), L=100(cM), Na=10000, Et=500 (gen) 




Figure SI: Simulation results for a linearly expanding population. Simulation results (symbols) are shown 
for the mean and the standard deviation (SD) of the total sharing vs. the current population size N c for a 
linearly expanding population (ancestral population size N a = 10000 until time E t = 500 generations ago, 
then a linear expansion until the indicated current size). The theoretical curves are taken from Eq. S30 for 
the mean and Eq. S31 for the SD, along with Eq. S29 for the coalescence time PDF, $(£). The integrals 
were evaluated (analytically wherever possible; see File SI) in Mathematica. 
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Figure S2: Fitting the distribution of the total sharing. (A) and (B) The fitted values of the compound 
Poisson parameters: rip (A), the average number of segments, and £q (B), the parameter of the shifted 
exponential distribution of the segment lengths (£q + m is the average segment length). The parameters, 
which appear in the approximate distribution of the total sharing, Eq. (24) in the main text, are plotted vs. 
N. Data correspond to Figures 4A and B in the main text. The figure shows that n roughly decreases as 
1/7V, while £ decreases for small N but then approaches a constant. (C) Same as Figure 4A in the main 
text, but magnified and plotted in log-scale. The fitted line, corresponding to the compound Poisson (Eq. 
(24) in the main text), provides a good fit to the central part of the curve, but it predicts a right tail much 
narrower than actually observed. 
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Figure S3: The upper bound on the probability of no sharing. Simulation results (symbols) are plotted 
for the fraction of pairs in the Wright-Fisher population that did not share even a single segment of length 
> m. Lines correspond to the theoretical upper bound, Eq. S33. (A) The probability of no sharing vs. the 
population size N (cf. Figure 2A in the main text). (B) The probability of no sharing vs. the chromosome 
size L (Figure 2C in the main text). 
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Figure S4: The standard deviation (SD) of the cohort- averaged sharing. Simulation results for a-j^, the 
SD of the cohort-averaged sharing (in percentage of the genome) vs. the cohort size n. The different curves 
correspond to (top to bottom): N = 1000, 2000, 4000, 8000, 16000. Lines correspond to Eq. (28) of the main 
text. Squares: the SD of the cohort-averaged sharing within each cohort of n = 100 individuals, averaged 
over 100 realizations of the simulations. Circles: for comparison, the data of Figure 6A of the main text, 
where the cohort-averaged sharing from all realizations and all individuals was first pooled, and only then 
the SD was calculated. For small n, the average over all realizations gives a smaller variance than when 
pooling, but is otherwise in agreement with the prediction. The agreement is likely because as long as n is 
not too small, the ancestral processes seen by different individuals in the cohort are only weakly correlated, 
and therefore the variance as calculated in the main text (over all ancestral processes) gives the correct 
result. 
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Figure S5: The distribution of the cohort-averaged sharing. (A) The joint distribution of the 3-way total 
sharing P (f^\ fx '^^j ■ To investigate whether the sharing fractions between two individuals to a third one 
depend on each other, we simulated the total sharing in populations with N = 10000, m = lcM, n = 100, 
and one chromosome of length L = 278cM. For each population, we recorded all distinct values of fj, and 
fj,' and plotted their joint histogram (after binning). The dependence is weak, but cannot be rejected 
based on a x 2 -test of independence (P- value 0.12). (B) A QQ-plot of the distribution of the cohort-averaged 
sharing. Simulation results correspond to Figure 6B in the main text. Briefly, we calculated the distribution 
of the cohort-averaged sharing for populations with N = 20000 individuals and one chromosome of size 
L = lOOcM. The minimal segment length was m = 0.5cM and the cohort size was n = 500. A QQ-plot of 
the distribution is shown, comparing the empirical distribution to a normal one. The distribution is quite 
close to normal in the central part, but with a broader right tail and a narrower left tail than expected. 
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Figure S6: A histogram of the number of pairs sharing at each locus. We simulated 100 Wright-Fisher 
populations with N = 10000, n = 100, and one chromosome of length L = 278cM, and searched for IBD 
shared segments using m = lcM. In the Genome coalescent simulator, recombination is resolved only 
within blocks whose size we set to O.OlcM. For each such block (excluding the first and last m(cM) of 
the chromosome), we recorded the number of pairs sharing a segment containing it, and then plotted the 
histogram over all blocks. We also plot a Poisson PDF with the same mean as the observed distribution. The 
histogram is significantly broader than the Poisson (Index-of-Dispersion test P-value less than Matlab's 
resolution), indicating that sharing tends to concentrate at specific loci. 
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Figure S7 : Power to detect an association when imputing by IBD. We plot the power to detect an association 
of a variant that exists in cases only, with and without imputation by IBD, and with sequenced individuals 
selected either randomly or according to their total sharing. This corresponds to the model of Implications to 
sequencing study design section in the main text. The parameters we used were: TV = 10000, L = 278cM (one 
chromosome), m = lcM, cohort size of n c = 500 cases and n t = 500 controls, and a total sequencing budget 
of n s = 100 individuals. The carrier frequency here is f3 = 0.02, and the threshold P-value is Q = 0.01. 
For each number of sequenced cases (x-axis), n c , s (where due to the budget limit, the number of sequenced 
controls is Tits — n s ~ n c,s), we plot the power according to Eqs. (32), (34), (35), and (36) in the main 
text. The power vs. n c s has a sawtooth shape. This was also documented in [8], where the same model 
was analyzed. The sawtooth is an effect of the discreteness of the model. For several different values of n CjS 

(or n'c^, for that matter), the minimal number of carriers b* required to reject the null hypothesis is the 
same, but the probability to observe that number of carriers increases with n CiS . As n c s increases further, 
b* finally increases by one, reducing the power dramatically. 
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Figure S8: The mean and the variance of an estimator of the population size vs. N. We plot simulation 
results (symbols) for the estimator of the population size, N, given in Eq. (38) in the main text. For each 
value of N, we simulated a number of Wright-Fisher populations and calculated the total sharing as in Figure 
2A in the main text. For each of the populations simulated for each n, we divided the individuals into four 
disjoint groups of 25 individuals each. In each group, we calculated the mean total sharing, /, between all 
( 2 2 5 ) pairs. We then applied the main text Eq. (38) to calculate the population size estimator N. Finally, for 

each N, we plotted the average of the estimator over all groups, \NJ (A), as well as its standard deviation 

(B). In (A), we also plot the identity line — N), and in (B), we also plot the theory, the main text 

Eq. (39). 
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Figure S9: The mean and the variance of an estimator of the population size vs. n. This figure is as Figure 
S8, except that here the mean (A) and standard deviation (B) of the estimator A^ are plotted vs. the number 
of individuals n and the population size is fixed, N = 10000. For each n, the total sharing between all pairs 
in a subset of n individuals from each population was averaged to obtain /, and then N was calculated 
according to Eq. (38) in the main text. Panel (A) also shows a horizontal line at ^A^ — N, demonstrating 

that N is overestimated, but only for small n. Panel (B) also shows the theoretical standard deviation from 
the main text Eq. (39). 
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Figure S10: IBD sharing in an admixture pulse model. (A) Simulations were carried out for the admixture 
pulse model as described in the Supplementary text (File SI). The mean and the standard deviation (SD) 
of the total fraction of IBD sharing are plotted. Symbols correspond to simulations (triangles: mean, 
circles/squraes: SD). Solid lines correspond to the main text Eq. (42) for the mean sharing and dashed 
lines to the main text Eq. (43) for the SD. Blue and red symbols/lines correspond to G a = 5 and G a = 10, 
respectively. The magenta dashed line corresponds to the theoretical mean sharing if admixture has just 
occurred, G a = 0. (B) P-values for the admixture test. We simulated the admixture pulse model with 
population size of N = 10000, L = 150cM, m = lcM, G a equals 5 or 10, and various values of a. For 
each G a and a, the (true) IBD shared segments were extracted and the population size was inferred as 

N = W0/(mfa) — 75/m, where fa is the average fraction of sharing over all pairs. Then, 500 populations 
were simulated with constant size N, and the SD of the cohort-averaged sharing was calculated. The P-value 
is the fraction of times the SD in the simulations was higher than the one in the admixed population. 
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Figure Sll: The effect of possible confounders on the admixture test. (A) The effect of a variable population 
size. We simulated a simple two-size population history, with ancestral population size N a until T = 50 
generations ago, followed by population size of N c = 10000 until present. N a varied between 1000 and 
20000, such that both expansions and contractions were studied. We simulated two scenarios: one without 
admixture and one with admixture taking place G a = 5 generations ago, replacing a proportion 1 — a = 0.3 
of the population. We than ran the admixture test as described in the main text and in Figure S10 (with 
100 simulated constant-size populations). The results demonstrate that for all values of N a tested, while 
for the no-admixture case, the test always resulted in an insignificant P-value, for the admixture case, 
the P-value was always below 0.05. We note, however, that it might be that a more extreme or complex 
demographic history will confound the admixture test; but at least for the parameters investigated here, 
the admixture test is robust. (B) The effect of IBD detection errors. We simulated populations of constant 
size N and dropped each detected IBD segment with probability e = 0.2 (as in the error model of The total 
sharing distribution and an error model section of the main text). Again, we simulated two scenarios: with 
and without admixture (same parameters as in (A)). We then ran the admixture test, and as in (A), the 
resulting P-values were significant only for the truly admixed populations. 
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